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1.  Introduction 

In  this  work,  we  consider  the  problem  of  hnding  the  distribution  of 

max  ?7^e,  e~V(0, 6),  (1) 

rjeK 

for  a  convex  set  /C  C  In  other  words,  we  study  a  Gaussian  process  with  a 
finite  Karhunen-Loeve  expansion  (Adler  &  Taylor  2007),  restricted  to  a  convex 
set  in  RP. 

While  this  is  a  well-studied  topic  in  the  literature  of  Gaussian  processes,  our 
aim  here  is  to  describe  an  implicit  formula  for  both  the  distribution  of  (I),  as 
well  as  the  almost  surely  unique  maximizer 

rj*  =  argmax  rf" e.  (2) 

veK 
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A  main  point  of  motivation  underlying  our  work  is  the  application  of  such  a 
formula  for  inference  in  modern  statistical  estimation  problems.  We  note  that  a 
similar  (albeit  simpler)  formula  has  proven  useful  in  problems  related  to  sparse 
regression  (Lockhart  et  al.  2013,  Taylor  2013).  Though  the  general  setting  con¬ 
sidered  in  this  paper  is  ultimately  much  more  broad,  we  begin  by  discussing  the 
sparse  regression  case. 

1.1.  Example:  the  lasso 

As  a  preview,  consider  the  penalized  regression  problem,  i.e.,  the  lasso  prob¬ 
lem  (Tibshirani  1996),  of  the  form 

P  =  a,Tgmmh\y-Xl3\\l  +  X\\l3\\i,  (3) 

(SgRp  ^ 

where  y  G  M",  X  G  and  A  >  0.  Very  mild  conditions  on  the  predictor 

matrix  X  ensure  uniqueness  of  the  lasso  solution  j3,  see,  e.g.,  Tibshirani  (2013). 
Treating  X  as  fixed,  we  assume  that  the  outcome  y  satisfies 

y-iV(A/?o,S),  (4) 

where  /3o  G  is  some  fixed  true  (unknown)  coefficient  vector,  and  S  e  is 

a  known  covariance  matrix.  In  the  following  sections,  we  derive  a  formula  that 
enables  a  test  of  a  global  null  hypothesis  Hq  in  a  general  regularized  regression 
setting.  Our  main  result,  Theorem  1,  can  be  applied  to  the  lasso  problem  in 
order  to  test  the  null  hypothesis  Hq  :  (3q  =  0.  This  test  involves  the  quantity 

Al  =  llA^ylloo, 

which  can  be  seen  as  the  Hrst  knot  (i.e.,  critical  value)  in  the  lasso  solution  path 
over  the  regularization  parameter  A  (Efron  et  al.  2004).  Recalling  the  duality  of 
the  (.1  and  £oo  norms,  we  can  rewrite  this  quantity  as 

Al  =  max  r7^(A^j/),  (5) 

rj£K 

where  JC  =  {y  ■.  ||?7|ji  <  1},  showing  that  Ai  is  of  the  form  (1),  with  e  =  X'^y 
(which  has  mean  zero  under  the  null  hypothesis).  Assuming  uniqueness  of  the 
entries  of  X'^y,  the  maximizer  rj*  in  (5)  is 

,  rsign(Ajy)  \Xjy\  =  \\X^yU  .  , 

0  othei-wise  '  J  = 
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Let  j*  denote  the  maximizing  index,  so  that  \Xj,y\  =  and  also  s*  = 

sign{Xj,y),  Qjk  =  XjT,Xk.  To  express  our  test  statistic,  we  define 


v-.= 


sG{  — 1 

i*  >0 


V  .  = 


mm  - - - - - - 

SG{  — 1  50j*/i;/0j*j* 

l  —  sQj*k/Oj*j*  <0 


Then  under  Hq  :  /3o  =  0,  we  prove  that 

i.(v,t/e;;,i)  -  »i-(Ai/9j4i) 
<i.(v,t/e‘4l)-»{v-./e‘4l) 


Unif(0, 1), 


(6) 


where  <I>  is  the  standard  normal  cumulative  distribution  function.  This  formula 
is  somewhat  remarkable,  in  that  it  is  exact — not  asymptotic  in  n,p — and  relies 
only  on  the  assumption  of  normality  for  y  in  (4)  (with  essentially  no  real  re¬ 
strictions  on  the  predictor  matrix  X).  As  mentioned  above,  it  is  a  special  case 
of  Theorem  1,  the  main  result  of  this  paper. 

The  above  test  statistic  (6),  which  we  refer  to  as  the  Kac-Rice  test  for  the 
LASSO,  may  seem  complicated,  but  when  the  predictors  are  standardized, 
||Aj||2  =  1  for  j  =  l,...p,  and  the  observations  are  independent  with  (say) 
unit  marginal  variance,  S  =  /,  then  V”*  is  equal  to  the  second  knot  A2  in  the 
lasso  path  and  is  equal  to  00.  Therefore  (6)  simplifies  to 


l-^(Ai) 
1  -  4>(A2) 


Unif(0,l). 


(7) 


This  statistic  measures  the  relative  sizes  of  Ai  and  A2,  with  values  of  Ai  A2 
being  evidence  against  the  null  hypothesis. 

Figure  1(a)  shows  the  empirical  distribution  function  of  a  sample  of  20,000  p- 
values  (6),  constructed  from  lasso  problems  with  a  variety  of  different  predictor 
matrices,  all  under  the  global  null  model  j5o  =  0.  In  particular,  for  each  sample, 
we  drew  the  predictor  matrix  X  uniformly  at  random  from  the  following  cases: 

•  small  case:  A  is  3  x  2,  with  values  (in  row-major  order)  equal  to  1, 2, . . .  6; 

•  fat  case:  X  is  100  x  10, 000,  with  columns  drawn  from  the  compound  sym¬ 
metric  Gaussian  distribution  having  correlation  0.5; 

•  tall  case:  X  is  10, 000  X  100,  with  columns  drawn  from  the  compound 
symmetric  Gaussian  distribution  having  correlation  0.5; 

•  lower  triangular  case:  X  is  500  x  500,  a  lower  triangular  matrix  of  Is  [the 
lasso  problem  here  is  effectively  a  reparametrization  of  the  1-dimensional 
fused  lasso  problem  (Tibshirani  et  al.  2005)]; 

•  diabetes  data  case:  X  is  442  x  10,  the  diabetes  data  set  studied  in  Efron 
et  al.  (2004). 
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(a)  Kac-Rice  test  (b)  Covariance  test 

Fig  1.  The  left  panel  shows  the  empirical  distribution  function  of  a  sample  of  20,000 
p-values  (6)  coming  from  a  variety  of  different  lasso  setups.  The  agreement  with  uni¬ 
form  here  is  excellent.  The  right  panel  shows  the  empirical  distribution  function  of  a 
sample  of  10,000  covariance  test  p-values,  computed  using  an  Exp(l)  approximation, 
using  three  different  lasso  setups.  The  Exp(l)  approximation  is  generally  conservative 
whereas  the  Kac-Rice  test  is  exact. 


As  is  seen  in  the  plot,  the  agreement  with  uniform  is  very  strong. 

In  their  proposed  covariance  test,  Lockhart  et  al.  (2013)  show  that  under  the 
global  null  hypothesis  Hq  :  f3o  =  0, 

Ai(Ai  —  A2) Exp(l)  as  n,p— 7^00,  (8) 

assuming  standardized  predictors,  ||  Aij||2  =  1  for  j  =  1, . .  .p,  independent  errors 
in  (4)  with  unit  marginal  variance,  E  =  /,  and  a  condition  to  ensure  that  A2 
diverges  to  00  at  a  sufficient  rate. 

In  finite  samples,  using  Exp(I)  as  an  approximation  to  the  distribution  of 
the  covariance  test  statistic  seems  generally  conservative,  especially  for  smaller 
values  of  n  and  p.  Eigure  1(b)  shows  the  empirical  distribution  function  from 
10,000  covariance  test  p-values,  in  three  of  the  above  scenarios.  [The  predictors 
were  standardized  before  applying  the  covariance  test,  in  all  three  cases;  this  is 
not  necessary,  as  the  covariance  test  can  be  adapted  to  the  more  general  case 
of  unstandardized  predictors,  but  was  done  for  simplicity,  to  match  the  form 
of  the  test  as  written  in  (8).]  Even  though  the  idea  behind  the  covariance  test 
can  be  conceivably  extended  to  other  regularized  regression  problems  (outside 
of  the  lasso  setting),  the  Exp(l)  approximation  to  its  distribution  is  generally 
inappropriate,  as  we  will  see  in  later  examples.  Our  test,  however,  naturally 
extends  to  general  regularization  settings,  allowing  us  to  attack  problems  with 
more  complex  penalties  such  as  the  group  lasso  and  nuclear  norm  penalties. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  describe  the 
general  framework  for  regularized  regression  problems  that  we  consider,  and  a 
corresponding  global  null  hypothesis  of  interest;  we  also  state  our  main  result. 
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Theorem  1,  which  gives  an  exact  p-value  for  this  null  hypothesis.  The  next  two 
sections  are  then  dedicated  to  proving  Theorem  1.  Section  3  characterizes  the 
global  maximizer  (2)  in  terms  of  the  related  Gaussian  process  and  its  gradient. 
Section  4  applies  the  Kac-Rice  formula  to  derive  the  joint  distribution  of  the 
maximum  value  of  the  process  (1)  and  its  maximizer  (2),  which  is  ultimately 
used  to  derive  the  (uniform)  distribution  of  our  proposed  test.  In  Section  5  we 
broadly  consider  practicalities  associated  with  constructing  our  test  statistic, 
revisit  the  lasso  problem,  and  examine  the  group  lasso,  principal  components, 
and  matrix  completion  problems  as  well.  Section  6  discusses  the  details  of  the 
computation  of  the  quantities  V~,  and  needed  for  our  main  result.  We  em¬ 
pirically  investigate  the  null  distribution  of  our  test  statistic  under  non-Gaussian 
errors  in  Section  7,  and  end  with  a  short  discussion  in  Section  8. 

2.  General  regularized  regression  problems 

We  examine  a  class  of  regularized  least  squares  problems  of  the  form 

/3g  argmin  i||y-X/3||^-hA-T’(/3),  (9) 

/3eKp  ^ 

with  outcome  y  G  M",  predictor  matrix  X  G  and  regularization  parameter 

A  >  0.  We  assume  that  the  penalty  function  V  satisfies 

V{P)  =  max  /3,  (10) 

u^C 

where  G  C  Rp  is  a  convex  set,  i.e.,  V  is  the  support  function  of  C.  This  serves  as 
a  very  general  penalty,  as  we  can  represent  any  seminorm  (and  hence  any  norm) 
in  this  form  with  the  proper  choice  of  set  C.  In  this  work,  we  will  use  the  abuse 
of  notation  of  calling  (10)  a  semi-norm  (that  is,  we  do  not  require  symmetry  of 
semi-norms).  We  note  that  the  solution  /3  =  /3\  above  is  not  necessarily  unique 
(depending  on  the  matrix  X  and  set  C)  and  the  element  notation  used  in  (9) 
reflects  this.  A  standard  calculation,  which  we  give  in  Appendix  A.l,  shows  that 
the  fitted  value  A/3  is  always  unique,  and  hence  so  is  V{$)- 
Now  define 

Al  =  min{A  >  0  :  V0\)  =  0}. 

This  is  the  smallest  value  of  A  for  which  the  penalty  term  in  (9)  is  zero;  any 
smaller  value  of  the  tuning  parameter  returns  a  non-trivial  solution,  according 
to  the  penalty.  A  straightforward  calculation  involving  subgradients,  which  we 
give  in  Appendix  A. 2,  shows  that 

Ai  =  Q(A^(/-Pxc^)j/),  (11) 

where  Q  is  the  dual  seminorm  of  V,  i.e.,  Q(/3)  =  max^gC°  P,  the  support 
function  of  the  polar  set  C°  of  C.  This  set  can  be  defined  as  C°  =  {v  :  v^u  < 
1  for  all  u  G  G},  or  equivalently. 


C°  =  {v.  V{v)  <  1}, 
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the  unit  ball  in  V.  Furthermore,  in  (11),  we  use  Pxc^  to  denote  the  projection 
operator  onto  the  linear  subspace 

XC^  =  X{v  :  u  _L  C}  C  M”. 

We  recall  that  for  the  lasso  problem  (3),  the  penalty  function  is  V{f3)  =  ||/3||i, 
so  Q(/3)  =  ||/3||oo;  also  C  =  {u  :  ||m||oo  <  1},  which  means  that  Pxc^  =  0;  and 
hence  Ai  =  ||X^j/||oo,  as  claimed  in  Section  1.1. 

Having  just  given  an  example  of  a  seminorm  in  which  C  is  of  full  dimension 
p,  so  that  C'^  =  {0},  we  now  consider  one  in  which  C  has  dimension  less  than 
p,  so  that  C'^  is  nontrivial.  In  a  generalized  lasso  problem  (Tibshirani  &  Taylor 
2011),  the  penalty  is  =  ||T*/?||i  for  some  chosen  penalty  matrix  D  G 
In  this  case,  it  can  be  shown  that  the  dual  seminorm  is  Q(/3)  =  ll-^^lloo- 

Hence  C  =  {u  :  min^jT^^u  Halloo  <  I},  and  C-*-  =  null(Il),  the  null  space  of  D. 
In  many  interesting  cases,  this  null  space  is  nontrivial;  e.g.,  if  D  is  the  fused 
lasso  penalty  matrix,  then  its  null  space  is  spanned  by  the  vector  of  all  Is.  In 
fact,  the  usual  form  of  the  LASSO  Tibshirani  (1996)  is  a  seminorm: 

minmize  ^lly  -  (7I  +  XP)\\l  +  A||/?||i,  y  ~  N{X/3,  a^I). 

Of  course,  the  above  problem  can  be  solved  by  solving 

minimize  —  PXPW"^  +  A||/3||i,  z  =  Py  ~  N{X(3,  a^P),  P  =  I  —  —  ll"^. 

Z  71 

Both  of  these  problems  fit  into  the  framework  (9). 

2.1.  A  null  hypothesis 

As  in  the  lasso  case,  we  assume  that  y  is  generated  from  the  normal  model 

y-iV(X/?o,S),  (12) 

with  X  considered  fixed.  We  are  interested  in  the  distribution  of  (1)  in  order  to 
test  the  following  hypothesis: 


Ho  :  iP(/?o)  =  0.  (13) 

This  can  be  seen  a  global  null  hypothesis,  a  test  of  whether  the  true  underlying 
coefficient  vector  /3o  has  a  trivial  structure,  according  to  the  designated  penalty 
function  V. 

Assuming  that  the  set  C  contains  0  in  its  relative  interior,  we  have  V{/3)  = 
0  Pc (3  =  0,  where  Pc  denotes  the  projection  matrix  onto  span(C).  There¬ 

fore  we  can  rewrite  the  null  hypothesis  (13)  in  a  more  transparent  form,  as 


Ho  :  Pcl3o  =  0. 


(14) 
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Again,  using  the  lasso  problem  (3)  as  a  reference  point,  we  have  span(C')  = 
for  this  problem,  so  the  above  null  hypothesis  reduces  to  Hq  ■  (io  =  0,  as  in 
Section  1.1.  In  general,  the  null  hypothesis  (14)  tests  /3o  G  C-*-,  the  orthocom¬ 
plement  of  C. 

Recalling  that 

Al  =  max  v^X'^{I  -  Pxc^)y^ 

one  can  check  that,  under  Hq,  the  quantity  Ai  is  precisely  of  the  form  (1),  with 
IC  =  C°,e  =  X^il-Pxc^)y,  and  6  =  Cov(e)  =  X^ (I - P^c^Ml -  Pxc^)X 
[as  E(e)  =  A^(/  —  Pxc^)XPo  =  0  when  /3o  G  C-*-]. 


2.2.  Statement  of  main  result  and  outline  of  our  approach 


We  now  state  our  main  result. 

Theorem  1  (Main  result:  Kac-Rice  test).  Consider  the  general  regularized  re¬ 
gression  problem  in  (9),  with  V{j3)  =  max„gcM^/3  for  a  closed,  convex  set 
C  CMP  containing  0  in  its  relative  interior.  Denote  X  =  C°  =  {v  :  V{v)  <  1}, 
the  polar  set  of  C,  and  assume  that  X  can  he  stratified  into  pieces  of  different 
dimensions,  i.e., 

p 

/C  =  IJ  djX,  (15) 

where  OqX,  . . .  dpX  are  smooth  disjoint  manifolds  of  dimensions  0,. .  .p,  respec¬ 
tively.  Assume  also  assume  that  the  process 

fv  =  'Tx'^{I  -  Pxc^)y,  bG/C,  (16) 

is  Morse  for  almost  every  y  G  M".  Finally,  assume  that  y  G  M"  is  drawn  from 
the  normal  distribution  in  (12). 

Now,  consider  testing  the  null  hypothesis  Hq  :  P{l3o)  =  0  [equivalently,  Hq  : 
Pc  Pa  =  0,  since  we  have  assumed  that  0  G  relint(C')y.  Define  =  G~^Hn  for 
Gn,Hp  as  in  (29),  (30),  V“,V^  as  in  (24),  (23),  and  cr^  as  in  (32).  Finally,  let 
rj*  denote  the  almost  sure  unique  maximizer  of  the  process  fjj  over  X, 

y*  =  argmax  /^, 


and  let  Ai 
under  Hq, 


fri*  denote  the  first  knot  in  the  solution  path  of  problem  (9).  Then 


det(A„.  -I-  zl)(j),y2  {z)  dz 

■q* 


det(A„.  -I-  zl)(j),j2  [z)  dz 


Unif(0, 1), 


(17) 


where  4>,y2  denotes  the  density  function  of  a  normal  random  variable  with  mean 
0  and  variance  cr^. 


Taylor  et  al. /Inference  via  Kac-Rice 


The  quantity  (17)  is  the  Kac-Rice  pivot  evaluated  at  ^  =  0.  Lemma  5  shows 
it  is  a  pivotal  quantity  for  the  mean  p  near  ry*,  derived  via  the  Kac-Rice  for¬ 
mula.  Here  we  give  a  rough  explanation  of  the  result  in  (17),  and  the  approach 
we  take  to  prove  it  in  Sections  3  and  4.  The  next  section,  Section  2.3,  discusses 
the  assumptions  behind  Theorem  1;  in  summary,  the  assumption  that  1C  sepa¬ 
rates  as  in  (15)  allows  us  to  apply  the  Kac-Rice  formula  to  each  of  its  strata, 
and  the  Morse  assumption  on  the  process  frj  in  (16)  ensures  the  uniqueness  of 
its  maximizer  rj*.  These  are  very  weak  assumptions,  especially  considering  the 
strength  of  the  exact,  non-asymptotic  conclusion  in  (17). 

Our  general  approach  is  based  on  finding  an  implicit  formula  for  P(Ai  >  t) 
under  the  null  hypothesis  Hq,  where  Ai  is  the  first  knot  in  the  solution  path  of 
problem  (9)  and  can  be  written  as 

Al  =  max  /^, 
r)£K 

where  —  Pxc^)yj  the  process  in  (16).  Our  representation  for  the 

tail  probability  of  Ai  has  the  form 

P(Ai  >t)  =E(Q(lp, ,,))).  (18) 

Here  Q  =  Q,,*  is  a  random  distribution  function,  l(t,oo)  is  the  indicator  function 
for  the  interval  (t,  oo),  and  rj*  is  a  maximizer  of  the  process  fri-  This  maximizer 
ry*,  almost  surely  unique  by  the  Morse  assumption,  satisfies 

V*  cdQ{X^{I-Pxc^))y)  ^  1C, 

with  dQ  the  subdifferential  of  the  seminorm  Q.  Under  the  assumption  that 
1C  =  U^_Qi9y/C,  the  main  tool  we  invoke  is  the  Kac-Rice  formula  (Adler  &  Taylor 
2007),  which  essentially  enables  us  to  compute  the  expected  number  of  global 
maximizers  occuring  in  each  stratum  djIC. 

Remark  1.  Note  that  for  almost  every  realization,  under  the  Morse  assump¬ 
tion,  there  is  generically  only  one  maximizer  overall  and  hence  the  number  of 
them  is  either  0  or  1.  We  use  the  term  ’’number  of  global  maximizers”  when  ap¬ 
plying  the  Kac-Rice  formula  as  it  applies  to  counting  different  types  of  points.  In 
our  applications  of  it,  however,  there  is  only  ever  0  or  1  such  points.  Similar  ar¬ 
guments  were  used  to  establish  the  accuracy  of  the  expected  Euler  characteristic 
approximation  for  the  distribution  of  the  global  maximum  of  a  smooth  Gaussian 
process  in  Taylor  et  al.  (2005). 

This  leads  to  the  distribution  of  Ai,  in  Theorem  2,  as  well  as  the  represen¬ 
tation  in  (18),  with  Q  given  in  an  explicit  form.  Unfortunately,  computing  tail 
probabilities  P(Ai  >  t)  of  this  distribution  involve  evaluating  some  complicated 
integrals  over  1C  that  depend  on  X,  E,  and  hence  the  quantity  Ai  as  a  test  statis¬ 
tic  does  not  easily  lend  itself  to  the  computation  of  p-values.  We  therefore  turn 
to  the  survival  function  associated  with  the  measure  ,  and  our  main 
result  is  that,  when  carefully  evaluated,  this  (random)  survival  function  can  be 
used  to  derive  a  test  of  Hq,  as  expressed  in  (17)  in  Theorem  1  above. 
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2.3.  Discussion  of  assumptions 

In  terms  of  the  assumptions  of  Theorem  1,  we  require  that  C  contains  0  in  its 
relative  interior  so  that  we  can  write  the  null  hypothesis  in  the  equivalent  form 
Hq  :  PcPo  =  0,  which  makes  the  process  frj  in  (16)  have  mean  zero  under  Hq. 
We  additionally  assume  that  C  is  closed  in  order  to  guarantee  that  has  a 
well-defined  (finite)  maximum  over  rj  G  K.  =  C° .  See  Appendix  A. 3. 

Apart  from  these  rather  minor  assumptions  on  C,  the  main  requirements  of 
the  theorem  are:  the  polar  set  C°  =  JC  can  be  stratified  as  in  (15),  the  process 
fri  in  (16)  is  Morse,  and  y  follows  the  normal  distribution  in  (12).  Overall,  these 
are  quite  weak  assumptions.  The  first  assumption,  on  1C  separating  as  in  (15), 
eventually  permits  us  to  apply  to  the  Kac-Rice  formula  to  each  stratum  djIC.  We 
remark  that  many  convex  (and  non-convex)  sets  possess  such  a  decomposition; 
see  Adler  &  Taylor  (2007).  In  particular,  we  note  that  such  an  assumption  does 
not  limit  our  consideration  to  polyhedral  1C:  a  set  can  be  stratifiable  but  still 
have  a  boundary  with  curvature  (e.g.,  as  in  1C  for  the  group  lasso  and  nuclear 
norm  penalties). 

Further,  the  property  of  being  a  Morse  function  is  truly  generic;  again,  see 
Adler  &  Taylor  (2007)  for  a  discussion  of  Morse  functions  on  stratified  spaces. 
If  frj  is  Morse  for  almost  every  y,  then  its  maximizers  are  almost  surely  isolated, 
and  the  convexity  of  1C  then  implies  that  frj  has  an  almost  surely  unique  max¬ 
imizer  rj* .  From  the  form  of  our  particular  process  in  (16),  the  assumption 
that  fri  is  Morse  can  be  seen  as  a  restriction  on  the  predictor  matrix  X  (or  more 
generally,  how  X  interacts  with  the  set  C).  For  most  problems,  this  only  rules 
out  trivial  choices  of  X.  In  the  lasso  case,  for  example,  recall  that  =  {Xr])’^y 
and  1C  is  equal  to  the  unit  l\  ball,  so  f*  =  ||A^i/||oo,  and  the  Morse  property 
requires  \Xjy\,  j  =  1, ... p  to  he  unique  for  almost  every  y  G  M.^.  This  can  be 
ensured  by  taking  X  with  columns  in  general  position  [a  weak  condition  that 
also  ensures  uniqueness  of  the  lasso  solution;  see  Tibshirani  (2013)]. 

Lastly,  the  assumption  of  normally  distributed  errors  in  the  regression  model 
(12)  is  important  for  the  work  that  follows  in  Sections  3  and  4,  which  is  based 
on  Gaussian  process  theory.  Note  that  we  assume  a  known  covariance  matrix 
E,  but  we  allow  for  a  dependence  between  the  errors  (i.e.,  E  need  not  be  diago¬ 
nal).  Empirically,  the  (uniform)  distribution  of  our  test  statistic  under  the  null 
hypothesis  appears  to  quite  robust  against  non-normal  errors  in  many  cases  of 
interest;  we  present  such  simulation  results  in  Section  7. 

2.4.  Notation 

Rewrite  the  process  in  (16)  as 

fv  =  -  Pxc^)il  -  Pxc^)y  =  rjCJC, 

where  X  =  {I  —  Pxc^)N  and  y  =  {I  —  Pxc^)y-  The  distribution  of  y  is  hence 

y  ~  N{XPo,  E),  where  E  =  (/  —  Pxc^)^{I  —  Pxc^)-  Furthermore,  under  the 
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null  hypothesis  Hq  :  PcPo  =  0,  we  have  y  ~  -/V(0,  S).  Fo£  convenience,  in  Sec¬ 
tions  3  and  4,  we  will  drop  the  tilde  notation,  and  write  y,  X,  S  as  simply  y,  X,  S, 
respectively.  To  be  perfectly  explicit,  this  means  that  we  will  write  the  process 
fr,  in  (16)  as 

where  y  ~  N{X(3o,  S),  and  the  null  hypothesisjs  Hq  :  N{0,  S).  Notice  that 

when  span(C')  =  MP,  we  have  exactly  y  =  y,  X  =  X,  E  =  S,  since  Pxc^  =  0. 
However,  we  reiterate  that  replacing  y,  X,E  by  y,X,Y,  in  Sections  3  and  4  is 
done  purely  for  notational  convenience,  and  the  reader  should  bear  in  mind  that 
the  arguments  themselves  do  not  portray  any  loss  of  generality. 

We  will  write  Eq  to  emphasize  that  an  expectation  is  taken  under  the  null 
distribution  Hq  :  y  ^  N{0,  E). 

3.  Characterization  of  the  global  m2Lximizer 

Near  any  point  rj  G  1C,  the  set  K.  is  well- approximated  by  the  support  cone  S'^/C, 
which  is  defined  as  the  polar  cone  of  the  normal  cone  The  support  cone 

SrilC  contains  a  largest  linear  subspace — we  will  refer  to  this  T^/C,  the  tangent 
space  to  X  at  ry.  The  tangent  space  plays  an  important  role  in  what  follows. 

Remark  2.  Until  Section  6,  the  convexity  of  the  parameter  space  X  is  not 
necessary;  we  only  need  local  convexity  as  described  in  Adler  &  Taylor  (2007), 
i.e.,  we  only  need  to  assume  that  the  support  cone  of  X  is  locally  convex  ev¬ 
erywhere.  This  is  essentially  the  same  as  positive  reach  (Federer  1959).  To  he 
clear,  while  convexity  is  used  in  connecting  the  Kac-Rice  test  to  regularized  re¬ 
gression  problems  in  (11)  (i.e.  it  establishes  an  equality  between  left  and  right 
hand  sides),  the  right  hand  side  is  well-defined  even  if  the  set  C°  is  not  convex. 
That  is,  the  X  in  (1)  need  not  he  convex.  In  fact,  the  issue  of  convexity  is  only 
important  for  computational  purposes,  not  theoretical  purposes.  In  this  sense, 
this  work  provides  an  exact  conditional  test  based  on  the  global  maximizer  of  a 
smooth  Gaussian  field  on  a  fairly  arbitrary  set.  This  is  an  advance  in  the  theory 
of  smooth  Gaussian  fields  as  developed  in  Adler  &  Taylor  (2007)  and  will  be 
investigated  in  future  work. 

We  study  the  process  in  (16),  which  we  now  write  as  fri  =  ry^AT^y  over 
7]  G  X,  where  y  ~  N{X(3o,  S),  and  with  the  null  hypothesis  Ho  :  y  ^  N{0,  E) 
(see  our  notational  reduction  in  Section  2.4).  We  proceed  as  in  Chapter  14  of 
Adler  &  Taylor  (2007),  with  an  important  difference  being  that  here  the  process 
fjj  does  not  have  constant  variance.  Aside  from  the  statistical  implications  of 
this  work  that  have  to  do  with  hypothesis  testing,  another  goal  of  this  paper  is 
to  derive  analogues  of  the  results  in  Taylor  et  al.  (2005),  Adler  &  Taylor  (2007) 
for  Gaussian  processes  with  nonconstant  variance.  For  each  rj  G  X,  we  define  a 
modified  process 

/z  = /^  -  2^^a»7Ws(V/|T,/c),  zGX, 
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where  is  the  vector  that,  under  Hq  :  y  ~  -/V(0,  S),  computes  the 

expectation  of  fz  given  V  f\T^jc,  the  gradient  restricted  to  i.e., 

z'^an,xM^  f\Tr,K)  =  Eo(/z|V/|t,,/c)- 

To  check  that  such  a  representation  is  possible,  suppose  that  the  tangent  space 
Tr/fC  is  j-dimensional,  and  let  14)  G  be  a  matrix  whose  columns  form  an 

orthonormal  basis  for  T^/C.  Then  f\T^ic  =  and  a  simple  calculation 

using  the  properties  of  conditional  expectations  for  jointly  Gaussian  random 
variables  shows  that 

Mf.\'^f\T,K)  =  Z^X^Pv,X,12y, 

where 

p^,x.s  =  ^XV^{Vf  ,  (19) 

the  projection  onto  X14)  with  respect  to  E  (and  denoting  the  Moore-Penrose 
pseudoinverse  of  a  matrix  A).  Hence,  we  gather  that 

f\Tr,K)  =  X'^P^^x,Y.y, 

and  our  modified  process  has  the  form 

n  =  fz-  z'^X^P^^x,^y  =  (Xzfil  -  Prj,x,Ay-  (20) 

The  key  observation,  as  in  Taylor  et  al.  (2005)  and  Adler  &  Taylor  (2007),  is 
that  if  ?7  is  a  critical  point,  i.e.,  X f\Tr,K  =  0,  then 

=  fz  for  all  z  e  X.  (21) 

Similar  to  our  construction  of  Q;^_x,E(V/|7yK:))  we  dehne  Cx,Ay)  such  that 

P  z^x^{i-P,,x,Ani-Plx,AXy  ~ 

T^TxT{^I-P^x,Ani-Plx,^)Xy  " 

=  z^CxMv)T:I-  (22) 

and  after  making  three  subsequent  definitions. 


n  -  z^cx,Av)  ■  f:j 

zdK:  z^/{-q)<l  I  -  z'^Cxxi'n) 

,,+  .  n-z'^CxAy)-^ 

V;r  =  mm  - ^ — -- — , 

'  z(iK:zTCxx('n)>'^  Z^CxAm 

K  =  _  ■  h 

zGK:  Cx,s(^?)  =  1 


(23) 

(24) 

(25) 


we  are  ready  to  state  our  characterization  of  the  global  maximizer  y. 
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Lemma  1.  A  point  ij  G  1C  maximizes  over  a  convex  set  K.  if  and  only  if  the 
following  conditions  hold: 

V/|t,k  =  0,  /^>V-,  /^<V+,  and  V°  <  0.  (26) 

The  same  equivalence  holds  true  even  when  1C  is  only  locally  convex. 

Proof.  In  the  forward  djrection  (=^>),  note  that  V f\T  k  =  0  implies  that  we  can 
replace  by  (and  by  /^)  in  the  definitions  (24),  (23),  (25),  by  the  key 
observation  (21).  As  each  z  G  K.  is  covered  by  one  of  the  cases  Cx,s{v)  <  1) 
Cx.siv)  >  1)  Cx,T.{v)  =  Ij  we  conclude  that 

U  >  fz  for  all  z  GlC, 


i.e.,  the  point  77  is  a  global  maximizer. 

As  for  the  reverse  direction  («^=),  when  rj  is  the  global  maximizer  of  over 
1C,  the  first  condition  V/i-r  =  0  is  clearly  true  (provided  that  1C  is  convex  or 
locally  convex),  and  the  other  three  conditions  follow  from  simple  manipulations 
of  the  inequalities 

f-n  >  fz  for  all  z  G  1C. 


□ 


Remark  3.  The  above  lemma  does  not  assume  that  K.  decomposes  into  strata, 
or  that  frj  is  Morse  for  almost  all  w,  or  that  y  ~  Af(A/3o,S).  It  only  assumes 
that  K.  is  convex  or  locally  convex,  and  its  conclusion  is  completely  deterministic, 
depending  only  on  the  process  fr^  via  its  covariance  function  under  the  null,  i.e., 
via  the  terms  Pri,x,Y.  and  Cx,T,iv)- 

We  note  that,  under  the  assumption  that  fr^  is  Morse  over  K.  for  almost  every 
y  GM.p,  and  K.  is  convex.  Lemma  1  gives  necessary  and  sufficient  conditions  for 
a  point  rj  G  1C  to  be  the  almost  .sure  unique  global  maximizer.  Hence,  for  convex 
tC,  the  conditions  in  (26)  are  equivalent  to  the  usual  subgradient  conditions  for 
optimality,  which  may  be  written  as 


V/,  e  N^K  ^  VfiT.K  =  0,  yf\(T,jcp  e  iv^/c, 

where  N^jlC  is  the  normal  cone  to  IC  aty  and  'Cl f\(^Tr,K)^  ^^6  gradient  restricted 

to  the  orthogonal  complement  of  the  tangent  space  T^/C. 

Recalling  that  is  a  Gaussian  process,  a  helpful  independence  relationship 
unfolds. 

Lemma  2.  With  7/ ~  N{XPo,T,),  for  each  fixed  y  G  1C,  the  triplet  (V“,  V+,  V°) 
is  independent  of  ffj . 

Proof.  This  is  a  basic  property  of  conditional  expectation  for  jointly  Gaussian 
variables,  i.e.,  it  is  easily  verified  that  Cov(/^  —  z'^Cx.^iy),  ffj)  =  0  for  all  z.  □ 
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4.  Kac-Rice  formulae  for  the  global  maximizer  and  its  value 


The  characterization  of  the  global  maximizer  from  the  last  section,  along  with 
the  Kac-Rice  formula  (Adler  &  Taylor  2007),  allow  us  to  express  the  joint  dis¬ 
tribution  of 

if  =  argmax  and  =  max  /^. 

jje/C 

Theorem  2  (Joint  distribution  of  (?7*,  /»)•))•  Writing  JC  =  for  a  strat¬ 

ification  of  1C,  for  open  sets  A  C  OCR, 


P(r;*  eA,UGO)=^[  E  f  det(-VV|T,;c)  ' 

j—Q-ldjKnA  \ 

Vf\T,K  =  o)fv/^JO)n,idR),  (27) 


^{v^-</?<v+,vo<o,/^eO} 


where: 

•  fvf^T  K  density  of  the  gradient  in  some  basis  for  the  tangent  space 

TjjlC,  orthonormal  with  respect  to  the  standard  inner  product  on  T^/C,  i.e., 
the  standard  Euclidean  Riemannian  metrie  on 

•  the  measure  Hj  is  the  Hausdorff  measure  induced  by  the  above  Riemannian 
metric  on  each  djIC; 

•  the  Hessian  f\Tr,K  evaluated  in  this  orthonormal  basis  and,  for  j  =  0, 
we  take  as  convention  the  determinant  of  a  0  x  0  matrix  to  be  1  [in  Adler 
&  Taylor  (2007),  this  was  denoted  by  V^fQ.x:,y,  to  emphasize  that  it  is 
the  Hessian  of  the  restriction  of  f  to  djIC]. 

Proof  This  is  the  Kac-Rice  formula,  or  the  “meta-theorem”  of  Chapter  10  of 
Adler  &  Taylor  (2007)  (see  also  Azai's  &  Wschebor  2008,  Brillinger  1972),  applied 
to  the  problem  of  counting  the  number  of  global  maximizers  in  some  set  A  C  Rp 
having  value  in  O  C  R.  That  is, 

P(?7*  G  A,  /^.  G  O)  =  E^#|7y  G  K-D  A  :  V/|7'^/c  =  0: 

V-  <f[l<  v+,  v°  <  0,  /,  G  o}) 

=  E(#{77G/CnA:  V/|t,k  =  0, 

v-</)(<v+,  V°<0, /((go}), 

where  the  second  equality  follows  from  (21).  Breaking  down  1C  into  its  separate 
strata,  and  then  using  the  Kac-Rice  formula,  we  obtain  the  result  in  (27).  □ 

Remark  4.  As  before,  the  conclusion  of  Theorem  2  does  not  actually  depend 
on  the  convexity  of  1C.  When  K.  is  only  locally  convex,  the  Kac-Rice  formula 
[i.e.,  the  right-hand  side  in  (27)/  counts  the  expected  total  number  of  global 
maximizers  of  frj  lying  in  some  set  A  C  Rp,  with  the  achieved  maximum  value 
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m  O  C  M.  For  convex  1C,  our  Morse  eondition  on  implies  an  almost  surely 
unique  maximizer,  and  henee  the  notation  P(?7*  G  A,  /^.  G  O)  on  the  left-hand 
side  of  (27)  makes  sense  as  written.  For  locally  convex  1C,  one  simply  needs  to 
interpret  the  left-hand  side  as 

P^ry*  G  A  for  some  f^^  =  fyj  fy  C  O^. 

Remark  5.  When  frj  has  constant  variance,  the  distribution  of  the  maximum 
value  frj*  can  be  approximated  extremely  well  by  the  expected  Euler  characteristic 
(Adler  &  Taylor  2007)  of  the  excursion  set  ff^{t,  oo)  n  1C, 


eiv,/c} 


This  approximation  is  exact  when  K.  is  convex  (Takemura  &  Kuriki  2002),  since 
the  Euler  characteristic  of  the  excursion  set  is  equal  to  the  indieator  that  it  is 
not  empty. 


4.1.  Decomposition  of  the  Hessian 

In  looking  at  the  formula  (27),  we  note  that  the  quantities  f)j,  V“,  V+,  V°  inside 
the  indicator  are  all  independent,  by  construction,  of  V/|T,,if  •  It  will  be  useful 
to  decompose  the  Hessian  term  similarly.  We  write 

-^^f\T,K  =  -H^  +  G^-f:(l  +  R^, 

where 


i?,  =  -Eo(vV|T,K|V/|T,;c),  (28) 

G,-/)(  =  -Eo(vV|t,;c|/^),  (29) 

Hy  =  -{V^f\T,K-Ry)-G,-fl  (30) 

At  a  critical  point  of  f\g-ic,  notice  that  =  0  (being  a  linear  function  of  the 
gradient  V /it,, kg  which  is  zero  at  such  a  critical  point).  Furthermore,  the  pair 
of  matrices  {Grjffj,  Hjj)  is  independent  of  V/|7yK:-  Hence,  we  can  rewrite  our  key 
formulae  for  the  distribution  of  the  maximizer  and  its  value. 

Lemma  3.  Eor  each  fixed  rj  G  1C,  we  have 

independently  0/ (V“,  V°,  77,,),  with 

=ry^X(/-P„,x.E)^/?o, 

-  P,,x.e)S(/  -  Plx,^)Xv, 


(31) 

(32) 
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and  (recall)  Pri,x,T:  =  X'^ ,  for  an  orthonormal  basis 

o/T^/C. 

Moreover,  the  formula  (27)  can  be  equivalenty  expressed  as: 
e  A,  /,.  e  O)  =  ^  /  E  f  det(-i7,  +  gJO)  ■ 

j^Q->djKr\A  \ 


^{v^</^<v;J',vo<o,/^gO}  )'*/'v/|t,^(0)  (33) 


=  E 


E 


j=o  •'  djK.e:A 


MA^,V,,AtT,o'2(lo)l{v"<V+,VO<0} 

V'v/|r^K(0)det(G^)'Hj(d77),  (34) 


where 


and  Arj  = 


^-{z-p,f  I2a-^ 

h(z)  det(A  +  zl) - -p= — 

V 


dz, 


(35) 


Remark  6.  Until  (34),  we  had  not  used  the  independence  of  ffj  and  V~ ,  V((,  V° 
(Lemma  2).  In  (34)  we  do  so,  by  first  integrating  over  ffj  (in  the  definition  of 
M),  and  then  over  V~  ,V((  ,V(^. 


4-2.  The  conditional  distribution 

The  Kac-Rice  formula  can  be  generalized  further.  For  a  possibly  random  func¬ 
tion  h,  with  h\Q.ic  continuous  for  each  j  =  0, ..  .p,  we  see  as  a  natural  extension 
from  (33), 


=E  [  ^(Kv)det{-H^  +  G,f(l)  ■ 

j—QJdjK  \ 


^{V,7</^<V+,VO<0} 


This  allows  us  to  form  a  conditional  distribution  function  of  sorts.  As  defined  in 
(35),  Ma  0-2  is  not  a  probability  measure,  but  it  can  be  normalized  to  yield 
one: 


QA.v.^i.CT^  {g) 


Ma.v.m, 0-2(1) 


(37) 
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Working  form  (36),  with  h{ri)  =  g{fn), 


■))  =  E 


E 


d,K 


[A,.V,.At„,o-2(ff)l{v-<V+,VO<0}  j 

V'v/|t„k:(0)  det{G^)Hjidr]) 
Ar,,V,,,AtT,CT2  (ff)MA^,V,,/i,,<T2(l)l{v~<V+,VO<0}^ 

V’v/|r„K:(0)det(G^)'Hj(d?7).  (38) 


This  brings  us  to  our  next  result. 


Lemma  4.  Formally,  the  measure  Qa,v, is  a  conditional  distribution  func¬ 
tion  of  frj*,  in  the  sense  that  on  each  stratum  djIC,  j  =  0, . .  .p,  we  have 


\v*  =  V^  =  A: 


1^)  —  Qa,V,ai,,(t2  (5)) 


(39) 


for  all  p  G  djJC. 

Proof.  By  expanding  the  right-hand  side  in  (38),  we  see 


E(5(/r,*))  —  E  /  f  QA,V./i,,o-2(5)MA,v,AtT,o'2(l)l{V-<V+,VO<0}  • 

j^QJdjJCJW3xR3 

'>PvfiT„K  (0)  det(G^)  Fa„,v„  {dA,  dV)  Hj  (dp) 

where  F^^yr,  denotes  the  joint  distribution  of  {A^,V^)  =  (A,,,  V“,  V+,  V°)  at  a 
fixed  value  of  p.  Consider  the  quantity 


Ma,V,m,.<t2(1)1{V-<V+,VO<0}V’V/,t„k: 


(0)  det(G^)  Ta,,v,  {dA,  dV)'Hj  {dp) . 


(40) 


We  claim  that  this  is  the  joint  density  (modulo  differential  terms)  of  77* ,  A^» ,  , 

when  p*  is  restricted  to  the  smooth  piece  djIC.  This  would  complete  the  proof. 
Hence  to  verify  the  claim,  we  the  apply  Kac-Rice  formula  with  open  sets  A  C 
djIC,  B  C  and  G  C  giving 


P(?7*  G  A,  A,,.  G  B,  G  G)  = 


^E^MA^,V„,/i,.(T2(l)l{v-<v+,vo<0}  ■ 

l{AeB.veC}^  '*/'v/|T^;c  (0)  det(G^)  Hj {dp) 
(1)1{V  -<V+,VO<0}  * 


A  JBxC 

V'v/|,r„;c(0)  det(G^)  FA„y„{dA,dV)  Hj{dp). 


Taking  A,B,C  to  be  open  balls  around  some  fixed  points  p,A,V,  respectively, 
and  sending  their  radii  to  zero,  we  see  that  the  joint  density  of  p* ,  A^. ,  V,,* ,  with 
p*  restricted  to  djIC,  is  exactly  as  in  (40),  as  desired.  □ 
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Remark  7.  The  analogous  result  also  holds  unconditionally,  i.e.,  it  is  clear  that 

Kgifv ))  =  e(Qa^.  (g)) , 

by  taking  an  expectation  on  both  sides  of  (39). 


4-3.  The  Kac-Rice  pivotal  quantity 


Suppose  that  we  are  interested  in  testing  the  null  hypothesis  Hq  :  y  ^  N{0,  S). 
We  might  look  at  the  observed  value  of  the  hrst  knot  Ai  =  ,  and  see  if  it  was 

larger  than  we  would  expect  under  Hq.  From  the  results  of  the  last  section, 

]P(/r,*  >  t)  =  (l(t,oo)))) 

and  so  the  most  natural  strategy  seems  to  be  to  plug  our  observed  value  of  the 
first  knot  into  the  above  formula.  This,  however,  requires  computing  the  above 
expectation,  i.e.,  the  integral  in  (38). 

In  this  section,  we  present  an  alternative  approach  that  is  effectively  a  con¬ 
ditional  test,  conditioning  on  the  observed  value  of  y* ,  as  well  as  and  V,,*. 
To  motivate  our  test,  it  helps  to  take  a  step  back  and  think  about  the  measure 
QA.v.Ai.o-^  defined  in  (37).  For  fixed  values  of  A,  V,/U,  cr^,  we  can  reexpress  this 
(nonrandom)  measure  as 

/OO 

g{t)  ■  qA,V,y,a(t)dt, 

-OO 

where  qh,v,y,,a  is  a  density  function  (supported  on  [V“,V’'"]).  In  other  words, 
Qa.v.m.ct^  (g)  computes  the  expectation  of  g  with  respect  to  a  density  qA,v,y,a^j 
so  we  can  write  QA,v,y,a-^(g)  =  IE(ff(W^))  where  IF  is  a  random  variable  whose 
density  is  qA,v,y,a^  ■  Now  consider  the  survival  function 

'^A,V,y,a^{t)  =  QA,V,At.o-2(l(t,oo))  =  IP(kF  >  t). 

A  classic  argument  shows  that  Sa.v./x.ct^  (IF)  ~  Unif(0, 1).  Why  is  this  useful? 
Well,  according  to  lemma  4  (or.  Remark  7  following  the  lemma),  the  hrst  knot 
Al  =  /»)•  almost  takes  the  role  of  IF  above,  except  that  there  is  a  further  level 
of  randomness  in  y* ,  and  A,,. ,  Vjj* .  That  is,  instead  of  the  expectation  of  g{fri* ) 
being  given  by  Qa,. .v,.  is  given  by  E(Qa„. .v,* (5))-  The 

key  intuition  is  that  the  random  variable 


^A,..  ,V,,.  ,0-2.  ifv*)  —  Qa,.  ,V,,.  ,0-2.  (!(/,,.  .oo)) 


(41) 


should  still  be  uniformly  distributed,  since  this  is  true  conditional  on  y* ,  A^<. ,  , 

and  unconditionally,  the  extra  level  of  randomness  in  y* ,  A^* ,  just  gets  “aver¬ 
aged  out”  and  does  not  change  the  distribution.  Our  next  lemma  formalizes  this 
intuition,  and  therefore  provides  a  test  for  Hq  based  on  the  (random)  survival 
function  in  (41). 
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Lemma  5.  [Kac-Rice  pivot]  The  survival  function  o/Qa  with  A  =  A^. , 

V  =  Vrj* ,  pi  =  pi-Q* ,  =  CT^. ,  and  evaluated  at  t  =  /^. ,  satisfies 

(42) 

Proof.  Fix  some  ft,  :  M  — )■  M.  A  standard  argument  shows  that  (fixing  A,  V,  /i,  ct^), 


QA.V,M.<T2(/ioSA.V,At,f72)  =  /  Ht)dt. 

Jo 


Now  we  compute,  applying  (38)  with  g  being  a  composition  of  functions, 


E(^h/ 


,Vn*  ,Wr,*  ,0" 


p  ^  / 

/  ®(*QA,,V,,,At,,,o'2fftoS'A  V  ct21Ma,,V,,aIt,,o'2(1) 


^{vy <v+,vo<o}  ^  ^v/|T,,>c (*^)  det{Grj)'Hj{dg) 


~  y~!  /  I  /  ^(^)'^^l^^>7.v,,M,,,o-2(l)l{vy<v+,v“<o} 

j—Q^OjK  \'-J0  / 

V'v/|,r^;c(0)  det(G^)'Hj(d?7) 

P  /• 

/i(i)  dt 

{V-<V+,VO<0})  • 

;c(0)  det(G^)'Hj(d?7) 


j=0'^9,K 


=  f  h{t)  dt. 

Jo 


□ 


Remark  8.  /n  particular,  Lemma  5  shows  that  under  Hq, 
V.v,*.o.a\(/.*)~Unif(0,l). 

This  proves  our  main  result,  Theorem  1,  noting  that  the  statistic  in  (17)  is  just 
§A  .  V  .  0  0-2  {fn- )  written  out  a  little  more  explicitly. 

Remark  9.  IFe  have  used  the  survival  function  of  fjj  conditional  on  g  being  the 
global  maximizer  as  well  the  additional  local  information  (A^,  V,,).  Conditioning 
on  this  extra  information  makes  the  test  very  simple  to  compute,  at  least  in 
the  LASSO,  group  LASSO  and  nuclear  norm  cases.  If  we  were  to  marginalize 
over  these  quantities,  we  would  have  a  more  powerful  test.  In  general,  it  seems 
difficult  to  analytically  marginalize  over  these  quantities,  but  perhaps  Monte 
Carlo  schemes  would  be  feasible.  Further,  Lemma  5  holds  for  any  pL,  implying 
that  this  marginalization  over  (A,,,V,,)  would  need  access  to  the  unknown  g. 
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Under  the  global  null,  Hq  :  /i  =  0,  this  is  not  too  much  of  an  issue,  though  it 
already  causes  a  problem  for  the  construction  of  selection  intervals  described  in 
Section  8.2. 

5.  Practicalities  and  examples 

Given  an  instance  of  the  regularized  regression  problem  in  (9),  we  seek  to  com¬ 
pute  the  test  statistic 


(43) 


i 


det(A„.  -I-  zl)(j),j2  {z)  dz 

V* 


and  compare  this  against  Unif(0, 1).  Recalling  that  this  leaves  us 

with  essentially  6  quantities  to  be  computed — Ai,  V^. ,  V“. ,  — and 

the  above  integral  to  be  calculated. 

If  we  know  the  dual  seminorm  Q  of  the  penalty  V  in  closed  form,  then  the 
first  knot  Ai  can  be  found  explicitly,  as  in  Ai  =  {I  —  Pxc-^)y)l  otherwise, 

it  can  be  found  numerically  by  solving  the  (convex)  optimization  problem 


Al  =  max77^X’^(J  —  Pxc^)y  subject  to  V{r])  <1. 


The  remaining  quantities,  ,  V~, ,  ,  i?,,* ,  ,  all  depend  on  rj*  and  on  the 

tangent  space  Again,  depending  on  Q,  the  maximizer  rj*  can  either  be 

found  in  closed  form,  or  numerically  by  solving  the  above  optimization  problem. 
Once  we  know  the  projection  operator  onto  the  tangent  space  there  is 

an  explicit  expression  for  cr^,,  recall  (32);  furthermore,  V~.  ,V^.  are  given  by 
two  more  tractable  (convex)  optimization  problems  (which  in  some  cases  admit 
closed  form  solutions),  see  Section  6. 

The  quantities  Grj- ,  are  different,  however;  even  once  we  know  ry*  and  the 
tangent  space  finding  G,,*,  involves  computing  the  Hessian 

which  requires  a  geometric  understanding  of  the  curvature  of  /  around  T^»/C. 
That  is,  Gri* ,  Hn-  cannot  be  calculated  numerically  (say,  via  an  optimization 
procedure,  as  with  Ai,  ),  and  demand  a  more  problem-specific,  mathe¬ 

matical  focus.  For  this  reason,  computation  of  G^* ,  can  end  up  being  an 
involved  process  (depending  on  the  problem).  In  the  examples  that  follow,  we 
do  not  give  derivation  details  for  the  Hessian  but  refer  the  reader  to 

Adler  &  Taylor  (2007)  for  the  appropriate  background  material. 

.JL  -  This  seemed  like  a  natural  spot  to  include  the  algorithm 

We  now  revisit  the  lasso  example,  and  then  consider  the  group  lasso  and 
nuclear  norm  penalties,  the  latter  yielding  applications  to  principal  components 
and  matrix  completion.  We  remark  that  in  the  lasso  and  group  lasso  cases,  the 
matrix  A^.  =  G~}Hri*  is  zero,  simplifying  the  computations.  In  contrast,  it  is 
nonzero  for  the  nuclear  norm  case. 
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Algorithm  1  Computing  the  Kac-Rice  pivot 

1:  Solve  for  Ai  and  r/*  {using  consistent  notation,  see  (16)  and  Section  2.4} 
2:  Form  an  orthonormal  basis  Vr^*  of  the  tangent  space  Tr^*K 
3:  Compute  the  projection  Prj*,x,'E  (19) 

4:  Evaluate  the  conditional  variance  and  Cx,T,{v*)  from  (32)  and  (22) 
5:  if  f^j'  has  zero  Hessian  then 
6:  Let  hr]*  =  0 

7:  else  if  f\T^^K  9  then 

8:  Let  hr]*  =  G/}Hr]*  from  (29)  and  (30) 

9:  end  if 

10:  Solve  the  optimization  problems  (23)  and  (24),  yielding  V~*,  Vt. 

11:  Evaluate  the  integrals  in  (43)  to  obtain  S  =  *  V  *  0  (-^i) 

12:  return  S 


Also,  it  is  important  to  point  out  that  in  all  three  problem  cases,  we  have 
span(C')  =  M.P,  so  the  notational  shortcut  that  we  applied  in  Sections  3  and  4 
has  no  effect  (see  Section  2.4),  and  we  can  use  the  formulae  from  these  sections 
as  written. 

5.1.  Example:  the  lasso  (revisited) 

For  the  lasso  problem  (3),  we  have  V{I3)  =  ||/3||i  and  C  =  {u  :  ||u||oo  <  1})  so 
Q(/3)  =  ||/3||oo  and  K.  =  C°  =  {v  :  ||u||i  <  1}.  Our  Morse  assumption  on  the 
process  =  rf" over  K,  (which  amounts  to  an  assumption  on  the  design 
matrix  X)  implies  that  there  is  a  unique  index  j*  such  that 

Al  =  iVj.yl  =  llV^ylloo  =  max  jfx'^y. 

Then  in  this  notation  rj*  =  sign(AlJ(  j/)  •  Cj*  (where  Cj*  is  the  j*th  standard  basis 
vector),  and  the  normal  cone  to  X  at  rj*  is 

=  {v  eW  :  sign(vj.)  =  sign{Xf,y),  \vj\  <  \vj*  \  for  all  j  ^  j*}. 

Because  this  is  a  full-dimensional  set,  the  tangent  space  to  X  at  y*  is  = 
{Nrj*X)^  =  {0}.  This  greatly  simplifies  our  survival  function  test  statistic  (41) 
since  all  matrices  in  consideration  here  are  0x0  and  therefore  have  determinant 
1,  giving 

^  -  ^(Ai/ct^*) 

The  lower  and  upper  limits  ,  V~.  are  easily  computed  by  solving  two  linear 
fractional  programs,  see  Section  6.  The  variance  cr^.  of  is  given  by  (32),  and 
again  simplifies  because  is  zero  dimensional,  becoming 

=  {r]*fX^EXr]*  =  XjXX^,. 

Plugging  in  this  value  gives  the  test  statistic  as  in  (6)  in  Section  1.1.  The  reader 
can  return  to  this  section  for  examples  and  discussion  in  the  lasso  case. 
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5.2.  Example:  the  group  lasso 

The  group  lasso  (Yuan  &  Lin  2006)  can  be  viewed  as  an  extension  of  the  lasso 
for  grouped  (rather  than  individual)  variable  selection.  Given  a  pre-dehned  col¬ 
lection  Q  of  groups,  with  Ug^gg  =  {1, . .  .p},  the  group  lasso  penalty  is  defined 
as 

G 

ViP)  =  Y,Wg\\Pgh, 

9=1 

where  /3g  €  denotes  the  subset  of  components  of  /?  €  corresponding  to 
g,  and  Wg  >  0  for  all  g  €  Q.  We  note  that 

C  =  :  \\ug\\2<Wg,  g&g}, 

so  the  dual  of  the  penalty  is 

Q(/3)  =  max  vj-^W/Zgh, 


V  :  ^Wglluglla  <  1 
geG 

Under  the  Morse  assumption  on  X'^y  over  JC  (again,  this  corresponds 

to  an  assumption  about  the  design  matrix  X),  there  is  a  unique  group  g*  such 
that 


/C  =  (17°  =  { 


Al  =  w  }\\Xj,y\\2  =  max  w  ^\\Xjy\\2  =  max  R^X'^y, 

where  we  write  Xg  G  to  denote  the  matrix  whose  columns  are  a  subset 

of  those  of  X,  corresponding  to  g.  Then  the  maximizer  g*  is  given  by 

I _ _  a  g  =  g* 

V*g  =  I  Wg\\Xjy\\2  ,  for  all  gGO, 

I  0  otherwise 

and  the  normal  cone  is  seen  to  be 

=  ju  e  :  Vg-  =  cXj.y,  \\vg\\2/wg  <  c\\Xj.y\\2/wg- 

for  all  g  g*,  c  >  o|. 

Hence  the  tangent  space  =  (iV,,./C)-*-  is 

TgtX  =  ju  €  :  Ug.Yj.y  =  0,  Ug  =  0  for  all  g  g*'^, 
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which  has  dimension  r*  —  1,  with  r*  =  rank(Xg*).  An  orthonormal  basis  for 
this  tangent  space  is  given  by  padding  an  orthonormal  basis  for  (span(Xj,t/))-*- 
with  zeros  appropriately.  From  this  we  can  compute  the  projection  operator 

and  the  variance  of  /^-  as 

2  _ _ f _ , 

P  -  wl4Xj.y\\l' 


^Xg.Xj,  (7  -  P,.^x,4^Xg.Xj,y. 


The  quantities  ,  V^.  can  be  readily  computed  by  solving  two  convex  pro¬ 
grams,  see  Section  6.  Finally,  we  have  =  0  in  the  group  lasso  case,  as  the 
special  form  of  curvature  matrix  of  a  sphere  implies  that 
in  (29).  This  makes  A,,.  =  G~}Hn*  =  0,  and  the  test  statistic  (43)  for  the  group 
lasso  problem  becomes 


/  ^<t>a\(.z)dz 


Jvx  ’’ 

V 


lP(Xr*  <  V,p/o-,)0  -  P(Xr-»  <  Ai/g,,.) 

P(Xr*  <  V+.  /(Tr,-)-  lP(Xr*  <  V^.  /(T^.  )  ’ 


(44) 


In  the  above,  Xr*  denotes  a  chi  distributed  random  variable  with  r*  degrees  of 
freedom,  and  the  equality  follows  from  the  fact  that  the  missing  multiplicative 
factor  in  the  Xv  density  [namely,  2^“’’  f^/r(r*/2)]  is  common  to  the  numerator 
and  denominator,  and  hence  cancels. 

Figure  2(a)  shows  the  empirical  distribution  function  of  a  sample  of  20,000 
p-values  from  problem  instances  sampled  randomly  from  a  variety  of  different 
group  lasso  setups  (all  under  the  global  null  model  /?o  =  0): 


•  small  case:  X  is  3  x  4,  a  fixed  matrix  slightly  perturbed  by  Gaussian  noise; 
there  are  2  groups  of  size  2,  one  with  weight  -\/2,  the  other  with  weight 

0.1; 

•  fat  case:  X  is  100  x  10,  000  with  features  drawn  from  the  compound  sym¬ 
metric  Gaussian  distribution  having  correlation  0.5;  here  are  1000  groups 
each  of  size  10,  each  having  weight  VlO; 

•  tall  case:  X  is  10, 000  X  100  with  features  drawn  from  the  compound  sym¬ 
metric  Gaussian  distribution  having  correlation  0.5;  there  are  1000  groups 
each  of  size  10,  each  having  weight  VlO; 

•  square  case:  X  is  100  x  100  with  features  drawn  from  the  compound  sym¬ 
metric  Gaussian  distribution  having  correlation  0.5;  there  are  10  groups 
each  of  size  10,  each  having  weight  VlO; 

•  diabetes  case  1:  X  is  442  x  10,  the  diabetes  data  set  from  Efron  et  al. 
(2004);  there  are  4  (arbitrarily  created)  groups:  one  of  size  4,  one  of  size 
2,  one  of  size  3  and  one  of  size  1,  with  varying  weights; 

•  diabetes  case  2:  X  is  442  x  10,  the  diabetes  data  set  from  Efron  et  al. 
(2004);  there  are  now  10  groups  of  size  1  with  i.i.d.  random  weights  drawn 
from  1  -I-  0.2  •  Unif(0, 1)  (generated  once  for  the  entire  simulation); 
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Fig  2.  The  left  panel  shows  the  empirical  distribution  function  of  a  sample  of  20,000 
p-values  (44)  computed  from  various  group  lasso  setups,  which  agrees  very  closely  with 
the  uniform  distribution.  The  right  panel  shows  the  empirical  distribution  functions  in 
three  different  group  lasso  setups,  each  over  10,000  samples,  when  p-values  are  instead 
computed  using  an  Exp(l)  approximation  for  the  covariance  test.  This  approximation 
ends  up  being  anti-conservative  whereas  the  Kac-Rice  test  is  exact. 


•  nested  case  1:  X  is  100  x  10,  with  two  nested  groups  (the  column  space 
for  one  group  of  size  2  is  contained  in  that  of  the  other  group  of  size  8) 
with  the  weights  favoring  inclusion  of  the  larger  group  first; 

•  nested  case  2:  X  is  100  x  10,  with  two  nested  groups  (the  column  space 
for  one  group  of  size  2  is  contained  in  that  of  the  other  group  of  size  8) 
with  the  weights  favoring  inclusion  of  the  smaller  group  first; 

•  nested  case  3:  X  is  100  x  12,  with  two  sets  of  two  nested  groups  (in  each 
set,  the  column  space  for  one  group  of  size  2  is  contained  in  that  of  the 
other  group  of  size  4)  with  the  weights  chosen  according  to  group  size; 

•  nested  case  4:  X  is  100  x  120,  with  twenty  sets  of  two  nested  groups  (in 
each  set,  the  column  space  for  one  group  of  size  2  is  contained  in  that  of 
the  other  group  of  size  4)  with  the  weights  chosen  according  to  group  size. 

As  we  can  see  from  the  plot,  the  p-values  are  extremely  close  to  uniform. 

In  comparison,  arguments  similar  to  those  given  in  Lockhart  et  al.  (2013)  for 
the  lasso  case  would  suggest  that  for  the  group  lasso,  under  the  null  hypothesis. 


Ai(Ai  -V^,)d  ,  . 

- - ^  Exp(l) 


as  n,p  ^  oo, 


under  some  conditions  (one  of  these  being  that  V~,  diverges  to  oo  fast  enough). 
Figure  2(b)  shows  the  empirical  distribution  function  of  10,000  samples  from 
three  of  the  above  scenarios,  demonstrating  that,  while  asymptotically  reason¬ 
able,  the  Exp(l)  approximation  for  the  covariance  test  in  the  group  lasso  case 
can  be  quite  anti-conservative  in  finite  samples. 
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5.3.  Nuclear  norm 

In  this  setting,  we  treat  the  coefficients  in  (9)  as  a  matrix,  instead  of  a  vector, 
denoted  by  B  We  consider  a  nuclear  norm  penalty  on  B, 

V{B)  =  \\B\U=tr{D), 

where  D  is  the  diagonal  matrix  of  singular  values  in  the  singular  value  decom¬ 
position  B  =  UDV"’".  Here  the  dual  seminorm  is 

Q{B)  =  ||H||op  =  max(D), 

the  operator  norm  (or  spectral  norm)  of  B,  i.e.,  its  maximum  singular  value. 
Therefore  we  have 


C  =  {A:||H||op<l}, 

IC  =  C°  =  {W  :\\W\\,<1}. 

Examples  of  problems  of  the  form  (9)  with  nuclear  norm  penalty  V{B)  =  ||H||* 
can  be  roughly  categorized  according  to  the  choice  of  linear  operator  X  =  X{B). 
For  example, 

•  principal  components  analysis:  if  X  :  — >•  is  the  identity  map, 

then  Al  is  the  largest  singular  value  of  j/  €  and  moreover,  V“.  is  the 

second  largest  singular  value  of  y; 

•  matrix  completion:  if  X  :  — >•  IR"^p  zeros  out  all  of  the  entries  of  its 

argument  outside  some  index  set  O  C  {1, . . .  n}  x  {1, . .  .p},  and  leaves  the 
entries  in  O  untouched,  then  problem  (9)  is  a  noisy  version  of  the  matrix 
completion  problem  (Candes  &  Recht  2009,  Mazumder  et  al.  2010). 

•  reduced  rank  regression:  if  X  :  IR"^^’  — )•  IR^^p  performs  matrix  multiplica¬ 
tion,  X(B)  =  XB,  and  y  G  IR"*^p,  then  problem  (9)  is  often  referred  to 
as  reduced  rank  regression  (Mukherjee  et  al.  2012). 

The  first  knot  in  the  solution  path  is  given  by  Ai  =  ||X^(j/)||op,  with  X^ 
denoting  the  adjoint  of  the  linear  operator  X.  Assuming  that  X'^{y)  has  singular 
value  decomposition  X'^{y)  =  UDV^  with  D  =  diag(di,c?2,  •  •  ■)  for  di  >  d2  > 
. . .,  and  that  the  process  =  {p,  X'^{y))  is  Morse  over  p  G  K.,  there  is  a  unique 
rj*  G  1C  achieving  the  value  Ai, 

V*  =  C/iVf , 

where  Ui,  Vi  are  the  first  columns  of  U,  V,  respectively.  The  normal  cone  NrfJC 
is 


=  jcHilf’  +  cUDV'^  :  U^U  =  1,  V'^V  =  I,  £>  =  diag(di, dz,  ■  ■  •), 

UlU  =  0,  V^V  =  0,  max(5)  <  1,  c  >  o|. 
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and  so  the  tangent  space  is 

T^.IC  =  span({c/iy/,  j  =  2, . .  .p}  U  {u,Vl,j  =  2, . . .  n}). 

From  this  tangent  space,  the  marginal  variance  in  (32)  can  be  easily  com¬ 
puted.  This  leaves  V“. ,  ,  G,,. ,  iJ,,*  to  be  addressed.  As  always,  the  quantities 

can  be  determined  numerically,  as  the  optimal  values  of  two  convex  pro¬ 
grams,  see  Section  6.  We  now  discuss  computation  of  the  {n+p  —  2)  x  {n+p—2) 
matrices  and  refer  the  reader  to  Candes  et  al.  (2012),  Takemura  & 

Kuriki  (2002,  1997)  for  some  of  the  calculations  that  follow.  The  entries  of  G,,- 
are  given  by 

G,.  (UiVl,  U,Vl)  =  G,.  {U,Vl,  UiVf) 

=  tr{V,U^CxMv*)), 

G^.  (UiVl,  UiVf)  =  (U.Vl,  U,Vl) 

=  Sij  ■  triViUl Gx,s (??*)), 

with  Cx,'siv*)  in  (22),  which  has  a  similar  computational  form  to  ,  and 
can  be  computed  from  knowledge  of  T^-IC  above.  The  entries  of  77^*  are  given 

by 

{UiVl,  UjVl)  =  Hr,.  {UrVi,  UiVf) 

=  tT{VrUfX^{y))  -  di  ■  G^.(G,Vf ,  UiVl) 

=  6rj  -dr-di-  G„.  (UjVl,  UiVl), 

Hr,.  {UiVl,  U,Vf)  =  Hr,.  {UrVi,  U,Vl) 

=  drj  [di  -  tT(ViU{ Gx.s(?7*))]  ■ 


When  expressed  in  a  suitable  ordering  of  the  above  basis  of  the  tangent  space, 
the  matrix  form  of  the  above  expressions  are,  abbreviating  G  =  Gx,^ip*), 

_  /GfGyi- /(„_!) x(„-i)  GliGWi  \ 

Vl.G^U.i  GfGFi- ’ 

17  _  ~  Gf’GVi) /(„_i)x(n-i)  D-i  —  diU'E,^CV-i  \ 

diVl.G^U.i  (di  -  U^GVi)  /(p_i)x(p-i)  j  ’ 

where  U-i  denotes  all  but  the  first  column  of  U,  with  V-i  similarly  defined, 
and  D-i  denotes  the  matrix  diag(d2)d3, . . .)  with  zeros  added  below  in  such  a 
way  that  Il_i  is  an  (n  —  1)  x  (p  —  1)  matrix.  That  is,  if  r  =  rank(A^(y)), 


^  ^  /diag(d2,---dr)\ 

\  b(n— r)x(p— 1)  / 

In  the  special  case  of  principal  components  analysis,  in  which  X  is  the  identity 
map  on  the  above  expressions  simplify  to  Gr,.  =  I(ri+p-2)y.{n+p-2)  and 


Hr,.  = 
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Fig  3.  The  left  panel  shows  the  empirical  distribution  function  of  a  sample  of  20,000  p- 
values  computed  over  a  variety  of  problem  setups  that  utilize  the  nuclear  norm  penalty. 
The  right  panel  shows  the  distribution  of  20,000  covariance  test  p-values  when  an 
Exp(l)  approximation  is  used,  which  shows  the  exponential  approximation  to  be  clearly 
inappropriate  for  the  nuclear  norm  setting. 


and  the  {n+p  —  2)  eigenvalues  of  A^*  =  are  seen  to  be 

{±dj,j  =  2,...r}U  [0]  •  {n+p-2r). 

In  Figure  3,  we  plot  the  empirical  distribution  function  of  a  sample  of  20,000 
p-values  computed  over  problem  instances  that  have  been  randomly  sampled 
from  the  following  scenarios,  all  employing  the  nuclear  norm  penalty  (and  all 
under  the  null  model  Bq  =  0,  with  Bq  being  the  underlying  coefficient  matrix): 

•  principal  components  analysis:  y  is  2  x  2, 3  x  4,  50  x  50,100  x  20,30  x 
1000,30  X  5,1000  X  1000; 

•  matrix  completion:  y  is  10  x  5  with  50%  of  its  entries  observed  at  random, 
100  X  30  with  20%  of  its  entries  observed  at  random,  10  x  5  with  a  non- 
random  pattern  of  observed  entries,  20  x  10  with  a  nonrandom  pattern  of 
observed  entries,  200  x  10  with  10%  of  its  entries  observed  at  random; 

•  reduced  rank  regression:  X  is  100  x  10  whose  entries  are  drawn  from  the 
compound  symmetric  Gaussian  distribution  with  correlation  0.5,  and  y  is 
100  X  5. 

As  is  evident  in  the  plot,  the  agreement  with  uniform  is  excellent  (as  above, 
each  particular  scenario  above  produces  Unif(0, 1)  p-values  regardless  of  how  X 
was  chosen,  modulo  the  Morse  assumption). 

Again,  along  the  lines  of  the  covariance  test,  we  consider  approximation  of 
Ai(Ai  —  V“.  )/(T^.  by  an  Exp(l)  distribution  under  the  null  hypothesis.  Figure 
3(b)  shows  that  this  approximation  is  quite  far  off,  certainly  much  more  so  than 
in  the  other  examples.  Preliminary  calculations  confirm  mathematically  that  the 
Exp(l)  distribution  is  not  the  right  limiting  distribution  here;  we  will  pursue 
this  in  future  work. 
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6.  Finding  ,  V+ 


Earlier,  in  Remark  we  noted  that  convexity  of  1C  was  not  needed  for  much  of 
the  work  in  this  paper.  It  is  only  when  we  wish  to  actually  compute  the  test 
statistic  that  we  restrict  to  convex  1C.  This  section  describes  convex  optimization 
problems  that  can  be  solved  to  find  the  values  of  which  are  one  of  the  key 
stumbling  pieces  to  computing  the  test  statistic.  Nevertheless,  if  we  had  access 
to  a  computer  that  could  compute  these  quantities,  along  with  A,,  we  would 
not  need  to  restrict  interest  to  convex  JC.  We  are  still  assuming  poshive  reach 
as  this  is  an  important  assumption  to  ensure  the  modified  processes  are  not 
singular.  See  Chapter  14  of  Adler  &  Taylor  (2007)  for  the  case  when  /C  is  a 
smooth  locally  convex  (i.e.  positive  reach)  subset  of  the  Euclidean  sphere.  This 
paper  considers  arbitrary  smooth  subsets  of  positive  reach  up  to  this  point. 

Inspecting  Lemma  5,  we  see  that  in  order  to  compute  the  p-value  in  (17),  we 
must  find  V“,  V+  at  rj  =  ij* .  Recall  the  definition  of  V~  in  (24), 

Is  -  z^CxAv)  ■  f^j 

zeK:z^/{v)<l  ^  -  z'^Cx,T,iv) 


where  Cx.siv)  is  defined  as  in  (22),  and  can  be  expressed  more  concisely  as 
Z  CxMn) 

Recall  that  at  77  =  77*  we  have  /))  =  fz  for  all  2;,  and  =  Ai,  so 

11  -  z^CxMvl  ■  If  =  z^{x^y  -  CxMv*)  ■  Al). 


Therefore,  is  given  by  the  maximization  problem 

z'^{X'^y  -  CxM'n*)^!) 

^  zeRp  l-z^CxM'>l) 

subject  to  z  G  1C,  1  —  z^Cx,s{v*)  >  0.  (45) 

This  is  a  generalized  linear-fractional  problem  (Boyd  &  Vandenberghe  2004). 
(We  say  “generalized”  here  because  the  constraint  z  G  JC  ^’{z)  <  1  can 
be  seen  as  an  infinite  number  of  linear  constraints,  while  the  standard  linear- 
fractional  setup  features  a  finite  number  of  linear  constraints.)  Though  not  con¬ 
vex,  problem  (45)  is  quasilinear  (i.e.,  both  quasiconvex  and  quasiconcave),  and 
further,  it  is  equivalent  to  a  convex  problem.  Specifically, 

V~.  =  max  {X'^y  -  Cx,Y.{r]*)\i) 

subject  to  Viu)  <w,w  —  u^Cx.^iv*)  =  1,  w  >0.  (46) 

The  proof  of  equivalence  between  (45)  and  (46)  follows  closely  Section  4.3.2  of 
Boyd  &  Vandenberghe  (2004),  and  so  we  omit  it  here.  Similarly,  V^.  is  given  by 
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the  convex  minimization  problem 

V±  =  min  {X'^y  -  Cx,T,{-n*)\i) 

subject  to  V{u)  <w,w  —  Cx,y.{r*)  =  —1,  w  >  0.  (47) 

In  the  worst  case,  V“*  and  can  be  determined  numerically  by  solving  the 
problems  (46)  and  (47).  Depending  on  the  penalty  V ,  one  may  favor  a  particular 
convex  optimization  routine  over  another  for  this  task,  but  a  general  purpose 
solver  like  ADMM  (Boyd  et  al.  2011)  should  be  suitable  for  a  wide  variety  of 
penalties.  In  several  special  cases  involving  the  lasso,  group  lasso,  and  nuclear 
norm  penalties,  V~,  and  V^.  have  closed-form  expressions.  We  state  these  next, 
but  in  the  interest  of  space,  withhold  derivation  details.  We  leave  more  precise 
calculations  to  future  work. 


6.1.  Lasso  and  group  lasso 


The  lasso  problem  is  a  special  case  of  the  group  lasso  where  all  groups  have  size 
one;  therefore  the  formulae  derived  here  for  the  group  lasso  also  apply  to  the 
lasso.  (In  the  lasso  case,  actually,  the  angles  below  are  always  ±7r.) 

For  any  group  g  G  G,  let  0{rj,g)  be  the  angle  between  X'^y  —  Cx,T:{v)g  and 
Cx,s{v)g  [where  recall  that  Cx^siv)  is  as  defined  in  (22)],  and  define  the  angles 
by 

■  l±/  \  \\G^X,T.{'n)g\\2  .  nf  , 

Sin  ^=^(77,3)  =  - ■  sm  61(77, 5). 

Wg 

Define  the  quantities 


and 


w^{il,g) 


\\Xjy  -  Cx, £(77)3112  •  cos 'ilj^{r],g) 

Wg  -  ||C'x.s(7?)3l|2  •  cos  (0(77,5)  -  ’ 


7;+(77,5) 

v~{ri,9) 


mm{w^{T],g)} 

^max{w^{T],g)} 

max{ 777=^(77,3)} 


if  ||C'x.e(7?)3||2  >  Wg 
otherwise 

if  ||C'x.s(77)3l|2  >  Wg 
otherwise 


Then  have  the  explicitly  computable  form 

V~.  =  max  v+{g*,g), 
Vi  =  min  v~{r]*,g). 

9#9* 
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(a)  Group  lasso  (b)  Nuclear  norm 


Fig  4.  Comparison  of  the  explicit  (analytically  derived)  form  ofV/,  versus  the  solution 
of  convex  program  (46)  found  numerically  by  ADMM,  for  the  group  lasso  and  nuclear 
norm  penalties. 


6.2.  Nuclear  norm 

For  principal  components  analysis,  when  X  =  I,  it  is  not  difficult  to  show  that 
V~,  =  6,2  =  —  min(Aj,»)  and  =  oo.  Numerically,  this  seems  to  be  true  even 
for  an  arbitrary  X  (reduced  rank  regression)  or  arbitrary  patterns  of  missingness 
(matrix  completion).  We  remark  that  computing  V~.  =  6,2  requires  solving  an 
eigenvalue  problem  of  size  at  least  max(n,  p)  x  max(n,  p) .  (The  ADMM  approach 
for  solving  problem  (46)  is  no  better,  as  each  iteration  involves  projecting  onto 
the  nuclear  norm  epigraph,  which  requires  an  eigendecomposition.)  This  is  quite 
an  expensive  computation,  and  a  fast  approximation  is  desirable.  We  leave  this 
for  future  work. 

6.3.  Relation  to  second  knot  in  the  solution  path 

The  quantity  V~.  is  always  a  lower  bound  on  the  first  knot  Ai,  and  possesses  an 
interesting  connection  to  another  tuning  parameter  value  of  interest  along  the 
solution  path.  In  particular,  is  related  to  the  second  knot  A2,  which  can  be 
interpreted  more  precisely  in  a  problem-specific  context,  as  follows: 

•  for  the  lasso,  with  standardized  predictors,  |i-AjH2  for  all  j  =  1, . .  .p,  we 
have  exactly  V~»  =  A2,  the  value  of  the  tuning  parameter  A  at  which  the 
solution  changes  from  one  nonzero  component  to  two  [this  follows  from  a 
calculation  as  in  Section  4.1  of  Lockhart  et  al.  (2013)]; 

•  for  the  group  lasso,  as  shown  in  Figure  5(a),  V~.  is  numerically  very  close 
to  the  value  A2  of  the  parameter  at  which  the  solution  changes  from  one 
nonzero  group  of  components  to  two; 
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(a)  Group  lasso  (b)  Nuclear  norm 


Fig  5.  Comparison  ofV/,  with  the  second  knot  A2  in  the  solution  path,  for  the  group 
lasso  and  the  nuclear  norm  problems.  For  the  group  lasso,  X2  is  defined  to  be  the  largest 
value  of  A  such  that  only  one  group  of  variables  is  active.  For  the  nuclear  norm,  it  is 
defined  as  the  largest  value  of  \  for  which  the  solution  has  rank  1. 


•  for  the  nuclear  norm  penalty,  as  shown  in  Figure  5(b),  V~.  is  numerically 
very  close  to  the  value  A2  of  the  parameter  at  which  the  solution  changes 
from  rank  1  to  rank  2. 


7.  Non-Gaussian  errors 

Throughout,  our  calculations  have  rather  explicitly  used  the  fact  that  X^y  is 
Gaussian  distributed.  One  could  potentially  appeal  to  the  central  limit  theorem 
if  the  components  oiy  —  XfiQ  are  i.i.d.  from  some  error  distribution  (treating  X 
as  hxed),  though  the  calculations  in  this  work  focus  on  the  extreme  values,  so 
the  accuracy  of  the  central  limit  theorem  in  the  tails  may  be  in  doubt.  In  the 
interest  of  space,  we  do  not  address  this  issue  here. 

Instead,  we  consider  a  scenario  with  heavier  tailed,  skewed  noise  and  present 
simulation  results.  In  particular,  we  drew  errors  according  to  a  t-distribution 
with  5  degrees  of  freedom  plus  an  independent  centered  Exp(l).  Figure  7  shows 
that  the  lasso  and  group  lasso  p-values  are  relatively  well-behaved,  while  the 
nuclear  norm  p-values  seem  to  break  down. 

8.  Discussion 

We  derived  an  exact  (non-asymptotic)  p-value  for  testing  a  global  null  hypoth¬ 
esis  in  a  general  regularized  regression  setting.  Our  test  is  based  on  a  geometric 
characterization  of  regularized  regression  estimators  and  the  Kac-Rice  formula, 
and  has  a  close  relationship  to  the  covariance  test  for  the  lasso  problem  (Lock¬ 
hart  et  al.  2013).  In  fact,  the  Exp(l)  limiting  null  distribution  of  the  covariance 
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Fig  6.  Empirical  distribution  function  of  p-values  for  the  lasso,  group  lasso  and  matrix 
completion  problems  under  heavy-tailed,  skewed  noise. 
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test  can  be  derived  from  the  formulae  given  here.  These  two  tests  give  similar 
results  for  lasso  problems,  but  our  new  test  has  exact  (not  asymptotic)  error 
control  under  the  null,  with  less  assumptions  on  the  predictor  matrix  X. 

Another  strength  of  our  approach  is  that  it  provably  extends  well  beyond 
the  lasso  problem;  in  this  paper  we  examine  tests  for  both  the  group  lasso  and 
nuclear  norm  regularization  problems.  Still,  the  test  can  be  applied  outside  of 
these  settings  too,  and  is  limited  only  by  the  difficulty  in  evaluating  the  p- value 
in  practice  (which  relies  on  geometric  quantities  to  be  computed). 

We  recall  that  the  covariance  test  for  the  lasso  can  be  applied  at  any  knot 
along  the  solution  path,  to  test  if  the  coefficients  of  the  predictors  not  yet  in  the 
current  model  are  all  zero.  In  other  words,  it  can  be  used  to  test  more  refined 
null  hypotheses,  not  only  the  global  null.  We  leave  the  extension  of  Theorem  1 
to  this  more  general  testing  problem  for  future  work. 

8.1.  Power 

One  important  issue  we  have  not  yet  addressed  is  the  power  of  our  tests.  While 
the  setting  under  which  our  tests  are  applicable  is  quite  broad  (in  the  lasso  prob¬ 
lem,  the  only  assumption  needed  is  that  X  have  columns  in  general  position), 
an  end  user  will  likely  be  interested  in  how  powerful  our  test  is. 


8.1.1.  Lasso 


For  the  lasso  case,  our  test  statistic  is  based  on  a  conditional  distribution  of 
||Ar^y||oo  [referred  to  as  the  Max  test  in  Arias-Castro  et  al.  (2011)].  Therefore, 
the  best  one  can  expect  is  to  have  similar  results  to  the  Max  test  in  this  situation. 
For  general  X,  a  simple  sufficient  condition  for  full  asymptotic  power  is  that  the 
gap  Al  —  A2  must  decrease  to  0  no  faster  than  Ai  — >  00.  This  follows  from  the 
Mills’  ratio  approximation 


A2)(Ai  -I-  A2) 


We  note  here  that,  in  principle,  by  taking  sequences  of  problems  (Ar„,  y„)  the 
set  of  limiting  distributions  for  jlAjj/njloo  under  Hq  is  effectively  the  set  of  all 
possible  distributions  for  the  maximum  absolute  value  of  a  (separable)  centered 
Gaussian  process.  This  is  a  huge  class,  implying  that  the  study  of  power  for 
such  a  problem  is  also  a  very  broad  problem.  The  Kac-Rice  test  is  based  on  a 
conditional  distribution  for  the  Max  test  statistic  and  is  applicable  for  almost 
all  matrices  X. 

The  behavior  of  the  Max  test  has  been  established  for  low  coherence  designs 
in  Arias-Castro  et  al.  (2011),  though  we  emphasize  that  our  test  makes  no 
assumption  about  coherence.  In  the  interest  of  space,  we  consider  the  power  of 
our  test  to  the  Max  test  in  the  orthonormal  design  case.  While  this  is  a  stronger 
assumption  than  low  coherence,  we  expect  a  similar  situation  to  hold  in  the 
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low  coherence  setting  discussed  in  .  Attempting  to  prove  that  these  orthogonal 
results  carry  over  to  a  low  coherence  scenario  is  an  interesting  problem  for  future 
research  and  is  beyond  the  scope  of  this  paper. 

In  the  orthonormal  case,  our  test  statistic  reduces  in  distribution  to 

!-$(%))’ 

where  Zi  ~  |A^(/Xi,l)|  independently  for  i  and  Z(i)  >  Z(2)  >  > 

are  their  order  statistics  in  decreasing  order. 

We  first  consider  the  case  of  finite  sparsity  and  identity  design.  The  simplest 
possible  alternative  hypothesis  is  the  1-sparse  case  where  a  single  Ri  is  nonzero. 
Let  =  ryj2  log(p)  and  all  other  =  0.  If  r  is  some  constant  greater  than  I, 
then  with  high  probability  the  first  knot  Ai  will  be  achieved  by  Z\.  Standard 
Gaussian  tail  bounds  can  be  used  to  upper  bound  the  Kac-Rice  pivot  and  we 
see  that  for  any  r  >  1  the  upper  bound  goes  to  0  as  p  — )■  oo,  so  in  this  case  the 
test  has  asymptotic  full  power  at  the  same  threshold  as  Bonferroni.  This  is  the 
best  possible  threshold  for  asymptotic  power  against  the  1-sparse  alternative. 
Similar  statements  hold  for  fc-sparse  alternatives  where  k  is  considered  fixed. 

We  now  consider  the  sparsity  growing  with  n.  Following  Arias-Castro  et  al. 
(2011),  we  set  of  the  underlying  means  Ri,  i  =  l,...n  to  be  nonzero 

and  equal  to  a.  This  is  arguably  the  worst  case  for  our  test,  which  is  based  on 
the  gap  between  and  Z(2).  Let  A{5,  n)  =  ^/2  logn  •  (1  —  -y/l  —  <5);  this  is  the 
threshold  for  non-trivial  asymptotic  for  the  Max  test.  That  is,  for  any  e  >  0, 
the  Max  test  is  asymptotically  power  powerful  (Type  I  -I-  Type  II  error  — >  0) 
if  the  nonzero  means  have  value  a  =  A{d,  n)  -\-  eV21ogn,  and  asymptotically 
powerless  (Type  I  -I-  Type  II  error  — )■  1  or  more)  if  the  nonzero  means  have 
value  a  =  A{5,  n)  —  e^2  logn. 

A  fairly  straightforward  argument  based  on  the  spacings  calculations  in  Lock¬ 
hart  et  al.  (2013),  the  asymptotic  power  of  the  Kac  Rice  level  a  test  is  a'/i^/(i-i-£) 
and  Type  I  -I-  Type  II  error  =  a  +  1  —  <  1.  As  <5  — )■  1,  we  recover 

the  full  power  in  the  finite  sparsity  case. 

8.1.2.  Group  lasso 

The  authors  are  not  aware  of  literature  on  the  global  power  of  tests  such  as 
the  group  lasso  In  practice,  using  the  group  lasso  penalty  allows  the  modeler 
freedom  to  favor  certain  groups  by  judicious  choices  of  the  group  weights.  Other 
interesting  examples  of  the  group  lasso  such  as  glinternet  (Lim  &  Hastie  2013) 
generally  have  widely  varying  group  sizes. 

Nevertheless,  if  we  are  willing  to  make  simplifying  assumptions  on  group  sizes 
and  the  designs,  then  we  can  say  something  about  power.  The  simplest  thing 
might  be  to  again  assume  orthonormal  design,  with  the  additional  assumption 
of  equal  sized  groups  and  weights.  In  this  very  specialized  setting,  questions 
of  power  reduce  to  questions  about  order  statistics  of  non-central  Xk  random 
variables  for  a  fixed  group  size  k.  These  random  variables  all  have  similar  tail 
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behavior  to  -/V(0, 1)  random  variables,  with  the  effective  number  of  observations 
now  being  n/k.  We  therefore  expect  similar  behavior  to  the  Max  test  if  k  is  fixed. 
Allowing  k  to  vary  with  n,  or  even  be  random  seems  an  interesting  problem 
to  consider,  even  in  the  orthonormal  design  setting.  For  general  X  without 
coherence  assumptions  this  seems  like  a  challenging  problem  indeed. 

8.1.3.  Nuclear  norm 

The  authors  are  not  aware  of  other  approaches  to  testing  the  global  null  using 
the  test  motivated  by  the  nuclear  norm,  based  the  largest  singular  value  of  X'^y. 
The  special  case  Ai  =  7  is  an  obvious  exception,  in  which  this  largest  singular 
value  corresponds  to  the  edge  of  the  spectrum  and  much  is  known  about  the  lim¬ 
iting  distribution,  the  Tracy-Widom  law  Johnstone  (2001).  In  this  case,  our  test 
statistic  is  based  on  the  conditional  distribution  of  Ai  given  A2, . . .  ,Xmin(n,p)- 
Rather  than  begin  a  detailed  analysis  of  the  power,  we  provide  a  small  simula¬ 
tion  study  to  compare  to  existing  implementations  based  on  the  Tracy-Widom 
limit.  The  simulation  is  the  work  of  Yunjin  Choi,  currently  a  Ph.D.  student  in¬ 
vestigating  the  use  of  the  Kac-Rice  pivot  to  inference  in  PCA.  The  example  is  a 
rank-one  example,  demonstrating  that  the  Kac-Rice  test  is  competitive  with  the 
Tracy-Widom  approximation  of  Johnstone  (2001).  The  Kac-Rice  test  has  the 
advantage  of  control  of  Type  I  error  and  applicability  to  the  matrix  completion 
or  reduced-rank  regression  setting. 


8.2.  Related  work  on  selective  inference 


We  conclude  this  paper  with  a  discussion  of  how  results  in  this  paper  can  be 
applied  to  selective  inference  and  how  this  paper  relates  to  other  recent  work. 
An  astute  reader  will  note  that  all  the  distributional  results  in  Section  4  are 
valid  for  any  y,  and  not  just  the  global  null  hypothesis  TJq  :  ^  =  0.  In  this  sense, 
the  results  in  this  paper  go  beyond  just  hypothesis  tests  as  they  can  be  used  to 
construct  intervals  containing  linear  functions  of  the  true  mean  y.  Formally,  they 
can  be  used  to  construct  intervals  for  defined  in  (31).  Specifically,  consider 
the  set 


SI  =\5 


I J  :  min 


’A„.  ,V„. 


,(/,.),l-SA,.,v,.A.^.(/r,*))  >«/2}.  (48) 


where  S  is  defined  in  (41).  As  S  is  an  exact  pivot  for  y^i*  we  see  that 


P  [yr^*  €  SI)  =  1  —  0;. 


(49) 


Applying  the  above  to  the  first  step  of  the  lasso,  we  can  construct  exact 
intervals  that  cover  X'^^^y  where  ATfi)  is  the  first  variable  chosen  by  the  lasso. 
In  work  initiated  after  the  initial  submission  of  this  paper  Lee  et  al.  (2013),  one 
of  the  authors  has  used  this  selective  inference  framework  for  exact  selective 
inference  for  selected  variables  in  the  problem 


.  .  .  1 

minimize  - 

(9  2 


Wy-xur  +  xmu 


(50) 
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Fig  7.  Comparison  of  Kac-Rice  test  to  Tracy  Widom  test  for  {n,p)  =  (50,10).  The 
alternative  is  a  rank  1  matrix  with  singular  value  VSO  *  10  *  0.125  =  7.90.  We  see  that 
the  Tracy  Widom  test  is  more  powerful  here,  probably  because  it  conditions  on  less 
then  the  Kac-Rice  test,  which  conditions  on  all  singular  values  of  the  data  matrix.  See 
the  remarks  following  Lemma  5.  The  Kac-Rice  test  is  applicable  to  situations  beyond 
PC  A  such  as  reduced-rank  regression  and  matrix  completion  while  the  Tracy-Widom 
approximation  is  not,  to  the  authors’  knowledge. 
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for  some  A  >  0  fixed.  With  a  small  modification,  one  can  similarly  analyze  the 
problem 

minimizei||y- .  (51) 

/3:||/3||i<k  Z 

In  the  PCA  setting,  the  above  interval  gives  an  interval  that  covers  a  pseudo¬ 
singular  value  Ui{y)'^ iLVi{y).  If  Ux{y),Vi{y)  recovered  C/i(/x),  Vi(Ai)  without  er¬ 
ror,  then  this  would  be  an  actual  singular  value  and  the  selection  interval  would 
be  a  confidence  interval.  Under  certain  conditions  for  the  mean  matrix  /i,  re¬ 
sults  in  random  matrix  theory  Paul  &  Johnstone  (2012)  ensure  that  the  random 
singular  vectors  converge  in  some  sense  to  the  population  singular  vectors,  in 
the  sense  that  Ui{fj,)'^Ui{y)  — >■  1.  Extending  such  results  to  the  nuclear  norm 
setting  with  X  I  seems  like  an  interesting  and  challenging  problem. 

Perhaps  the  strongest  sense  in  which  these  results  differ  from  earlier  results 
on  selective  inference  is  that  they  are  exact  and  computationally  feasible.  For 
the  lasso,  for  instance,  we  need  only  assume  that  X  has  columns  in  general 
position. 

Another  sense  in  which  these  results  differ  from  existing  results  on  selective 
inference  is  that  our  testing  problem  allows  for  the  possibility  that  the  choice 
is  made  from  some  continuous  set.  In  the  lasso  setting,  a  “mode”  is  chosen  by 
selecting  a  set  of  active  variables  and  their  signs.  This  differs  from  the  group 
lasso  setting  in  the  sense  that  the  active  subgradient  varies  continuously  in  some 
set.  This  added  complexity  makes  conditioning  on  the  “active  subgradient”  a 
somewhat  more  technically  dubious  procedure.  This  conditioning  is  handled 
explicitly  via  the  Kac-Rice  formula  which,  in  this  case,  can  be  thought  of  as 
enabling  us  to  carry  out  a  partition  argument  over  a  smooth  partition  set. 

Finally,  the  construction  of  the  process  is  a  contribution  to  the  theory 
of  smooth  Gaussian  random  fields  as  described  in  Adler  &  Taylor  (2007).  Our 
construction  allows  earlier  proofs  that  apply  only  to  (centered)  Gaussian  random 
fields  of  constant  variance  (i.e.  marginally  stationary)  to  smooth  random  fields 
with  arbitrary  variance.  Even  in  the  marginally  stationary  case,  the  conditional 
distribution  Q  defined  in  (38)  provides  a  new  tool  for  exact  selective  inference 
at  critical  points  of  such  random  fields.  We  leave  this,  and  many  other  topics, 
for  future  work. 
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Appendix  A:  Proofs  and  supplementary  details 
A.l.  Uniqueness  of  the  fitted  values  and  penalty 

Consider  the  strictly  convex  function  f{u)  =  ||y  — ^112/2.  Suppose  Pi, 132  are  two 
solutions  of  the  regularized  problem  (9)  with  Xj3i  Xj32-  Note 

f{Xpi)  +  XP0i)  =  f{Xh)  +  XpCP2)  =  C*, 

the  minimum  possible  value  of  the  criterion  in  (9).  Define  z  =  aPi  +  (1  —  a)/32 
for  any  0  <  a  <  1.  Then  by  strict  convexity  of  /  and  convexity  of  V , 

fiXz)  +  XP{z)  <  afiXpi)  +  (1  -  a)f{XP2)  +  \(ccV0i)  +  (1  -  a)V02)) 

=  (1  —  a)c*  +  ac* 

=  c*, 

contradicting  the  fact  that  c*  is  the  optimal  value  of  the  criterion.  Therefore  we 
conclude  that  XPi  =  X (32- 

Furthermore,  the  uniqueness  of  the  fitted  value  A/3  in  (9)  implies  uniqueness 
of  the  penalty  term  V{(3),  for  any  A  >  0. 


A. 2.  Calculation  of  Ai 


By  the  subgradient  optimality  conditions,  /3  is  a  solution  in  (9)  if  and  only  if 


X^jy-XP) 

X 


e  dViP), 


where  dV{P)  denotes  the  subdifferential  (set  of  subgradients)  of  V  evaluated  at 
p.  Recalling  that  V{P)  =  max„gc'u^/3,  we  can  rewrite  this  as 


,  C  and  =  rm. 

A  A 


(52) 


The  first  statement  above  is  equivalent  to  Q{X'^{y  —  A/3))  <  A,  where  Q{P)  = 
max^gco  u^/3  and  C°  is  the  polar  body  of  C.  Now  define 


Xi  =  Q{x^{i-Pxc^)y), 


where  Pxc^  denotes  projection  onto  the  linear  subspace 

X{v:v  3lC}  =  XC^. 

Then  for  any  A  >  Ai,  the  conditions  in  (52)  are  satisfied  by  taking  A/3  =  Pxc^V^ 
and  accordingly,  V{P)  =  0. 

On  the  other  hand,  if  A  <  Ai,  then  we  must  have  V{i3)  7^  0,  because  V{P)  =  0 
implies  that  A/3  €  AC-*-,  in  fact  XP  =  Pxc^Vt  so  A  <  QiX'^ {I  —  Pxc^)y)  and 
the  first  condition  in  (52)  is  not  satisfied. 


Taylor  et  al. /Inference  via  Kac-Rice 


38 


A. 3.  Bounded  support  function 
We  first  establish  a  technical  lemma. 

Lemma  6.  Suppose  that  D  C  M”  has  0  €  relint(Z?).  Then  for  all  u  €  span(£)), 

max  v'^u  <  M, 

v&D° 

for  some  constant  M  <  oo,  with  D°  =  {w  :  u'^v  <  1  for  all  u  €  I?}  the  polar 
body  of  D.  In  other  words,  D°  n  span(D)  is  a  bounded  set. 

Proof.  By  assumption,  we  know  that  PoBfr)  C  D  for  some  r  >  0,  where  Pd  is 
the  projection  matrix  onto  span(D),  and  B{r)  is  the  £2  ball  of  radius  r  centered 
at  0.  Fix  d  >  0,  and  dehne  Sd^t-^+s  =  dPDB{r~^  +  6),  the  boundary  of  the 
projected  ball  of  radius  r~^  +(5.  Then  for  any  v  G  SD,r-^+s  +  we  can  define 
u  =  rPDv/\\PDv\\2,  and  we  have  u  G  PDB{r)  C  D  with  u^v  =  l  +  dr  >  1,  hence 
V  ^  D° .  As  this  holds  for  all  d  >  0,  we  must  have 

D°CRP\(^\J  SD,r-^+s  +  =  PoBir-^)  +  D^. 

S>0 

This  implies  that  PdD°  C  PDB{r~^),  which  gives  the  result.  □ 

We  now  use  Lemma  6  to  show  that  the  process  in  (16)  is  bounded  over 
7]  G  K,  =  C° ,  assuming  that  C  is  closed  and  contains  0  in  its  relative  interior. 

Lemma  7.  If  C  G  is  a  closed,  convex  set  with  0  G  relint)!!!),  then 

max  /„  <  M, 
r)£K 

where  fr^  is  the  process  in  (16),  K.  =  C° ,  and  M  <  00  is  a  constant. 

Proof.  We  reparametrize  f^  =  {Xr])'^{I  —  Pxc^)y  e^s 

gv  =  v'^{I  -  Pxc^  )y,  vGXK, 

and  apply  Lemma  6  to  the  process  g^,  with  D  =  (A'^)“^C',  the  inverse  image  of 
C  under  the  linear  map  AT^.  It  is  straightforward  to  check,  using  the  fact  that  C 
is  closed,  convex,  and  contains  0  (i.e.,  using  C°°  =  C),  that  D°  =  XC°  =  XX. 
By  Lemma  6,  therefore,  we  know  that  gv  <  M  for  all  v  G  span(Il).  The  proof  is 
completed  by  noting  that  span(I?)  =  span))^^)”^^)  =  (XC-*-)-*-,  which  means 
(I  —  Pxc^)y  €  span(L>)  for  any  y  G  M".  □ 
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